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Abstract 



There is a common problem of operating on hash values of elements of some 
database. In this paper there will be analyzed informational content of such general 
task and how to practically approach such found lower boundaries. Minimal prefix tree 
which distinguish elements turns out to require asymptotically only about 2.77544 bits 
per element, while standard approaches use a few times more. While being certain of 
working inside the database, the cost of distinguishability can be reduced further to 
about 2.33275 bits per elements. Increasing minimal depth of nodes to reduce prob- 
ability of false positives leads to simple relation with average depth of such random 
tree, which is asymptotically larger by about 1.33275 bits than lg(n) of the perfect 
binary tree. This asymptotic case can be also seen as a way to optimally encode n 
large unordered numbers - saving lg(n!) bits of information about their ordering, which 
can be the major part of contained information. This ability itself allows to reduce 
memory requirements even to ln(2) ~ 0.693 of required in Bloom filter for the same 
false positive probability. 

1 Introduction 

There is often considered a problem of counting the size of some discrete family, like the 
number of full binary trees (all nodes have or 2 children) with n leaves is the Catalan 
number ([!]): C„ = (^^) /{n + 1), which is asymptotically ^3/2'^ - Choosing one of m elements 
requires lg(m) bits of information (Ig = logj), so the minimal average amount of information 
to choose one of such trees is asymptotically 2 bits per leaf. 

The situation usually improves if the probability distribution of elements is nonuniform: 
if it is {pi}i=i..n, the average number of bits per element required to encode such proportions 
of elements is the Shannon entropy: 



where the equality holds only for uniform distribution. This formula can be seen that we 
need lg(l/pj) bits of information to encode choice/symbol of pt probability and so at average 
we need Shannon entropy bits of information per choice. This informational capacity can 
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Figure 1: Calculation of average entropy (-^2), average depth {D2) and false positive (F2) 
probability (of accidently getting to a leaf) of infinite prefix tree family with two leaves. 

be easily approached as near as we need using entropy coder, like Arithmetic Coding [2] 
operating on two states (boundaries of range), or recent single state Asymmetric Numeral 
Systems |3], in which encoding pi probability symbol increases the state about 1/pi times. 

The improvement of using probabilities while encoding becomes crucial if the set of 
possibilities is infinite. Using entropy coding would still lead to that for any length, there 
are elements requiring even more bits to store them. However, their probabilities may 
quickly drop, such that the average amount of required bits remains finite. In Fig. [T]we can 
see that we have such situation while considering the space of minimal prefix trees required 
to distinguish between some bit sequences, chosen with uniform distribution for each bit. 

Such prefix tree is a natural representation while considering hash function of elements 
of a dictionary/database - deterministically chosen pseudorandom bit sequence assigned to 
each element. These sequences have usually fixed length, but we will not use such assumption 
here - we can imagine that initially they are infinite, then they are cut to individually chosen 
optimal finite length prefix: minimal to distinguish from others. 

So let us assume that there are n infinite sequences of independently chosen bits with 
P(0) = -P(l) = 1/2 probability and we build the minimal prefix tree distinguishing these 
sequences, like in Fig. |2^. Imagine that someone have only such a tree. While asking for 
some element from the dictionary, he would always get answer that it is in the tree, finding 
the corresponding leaf - there is not possible false negative case. However, while asking for 
an element outside the dictionary, there is a nonzero probability to also get to a leaf - there 
are possible false positive cases. By elongating paths to the leaves to some fixed minimal 
depth like in Fig. [2]d, we can use additional information about the sequences to reduce this 
probability as much as we need. Finally increasing this depth to some fixed large number, 
we would almost surely just store fixed length bit sequences, but without information about 
their order - it allows to save lg(n!) bits of information about the order. 

We will see that statistical ensemble of minimal prefix trees generated this way has asymp- 
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totically Shannon entropy 2.77544 bits per element - this is the minimal average amount of 
information required to store such tree. These minimal trees have large false positive prob- 
ability (about 0.721), but they can be safely used while there is certainty of working in the 
corresponding dictionary. Having this certainty, we can reduce the tree further to about 
2.33275 bits/element, increasing false positive probability to 1. Attaching some additional 
information to the leaves of such tree allows to define their type, for example to classify if 
given word is noun or verb while being certain of working within given dictionary. In such 
case, the minimal amount of required bits per element is 2.33275 plus attached information, 
like 1 bit to tell if the word is verb or noun (or less if they have unequal probabilities). The 
minimality of used information can be also seen as advantage for cryptographic purposes, 
reducing costs of eventual leaks of such information. 

Considered data compression is rather impractical while operating on the tree, but it 
might be useful while storing or transmitting it. Additionally, these considerations and 
known theoretical boundaries may help developing methods to handle memory in a more 
optimal way. 

Beside informational content, there will be also found average depth of such random tree 
and we will relate values of these properties. These relations lead to relatively compact 
formulas for some complicated recurrences. We will also find probability of false positives 
and use it to compare this approach with commonly used Bloom filter. Table [T] gathers 
calculated properties for some more interesting parameters. 

2 Entropy of minimal prefix tree 

The additivity of entropy allows us to divide the selection of element into a smaller choices. 
To encode such randomly generated prefix tree, we can divide the choice of the tree into 
situations in single nodes. It is essential to properly calculate the probabilities of choices, in 
such case all representations has to lead to the same entropy. There are plenty of possibilities, 
like just storing if the node has none, left, right or both children. 

For convenience we will use different representation here: for each node store distribution 
of sequences between its left and right child. Specifically, assume that for given node we know 
the total number (n) of sequences going through it (leaves in its subtree). Now for this node 
we will only encode how many of these sequences make the next step left {k G {0, 1, ■■,n}) 
- the rest of them (n — k) go right. We can then recursively repeat this process for both 
children, knowing the total number of leaves in their subtrees. Imagining bit sequences as 
infinite binary expansions of numbers from [0, 1] range, we can see it as that the first choice 
is how many of these numbers goes to the [0, 1/2] half, then we recursively go into the two 
subranges. 

There is a problem with the root of the tree - the total number of elements has to be 
written separately. This issue will not be addressed here, but storing a natural number takes 
about lg(n) bits of information, so this cost divided by the number of elements vanishes 
asymptotically. 

Denote by Hn the average number of bits required to encode minimal prefix tree for n bit 
sequences (leaves), which bits were independently randomly chosen with P(0) = -P(l) = 1/2 
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on average: 
H4 = 7.986 bits 



on average: 
10.174 bits 



total Hi bits 
on average: 

prefix tree 

on average: 
H4 bits 

average deptli: 
'4=3.143 

+ 

sufixes 

on average 

bits/element 




on average: 
H'4 = 6.224 bits 



Figure 2: Three basic cases we will consider: a) minimal prefix tree, b) minimal depth d 
prefix tree, c) asymptotic behavior for large d and d) reduced minimal prefix tree. 



probability distribution. Connecting the node with its two children we get the recurrence: 

n-l 




Hk + Hn-k 1 = /i„ + 2 ^ + 2-^ 

fc=0 



where /^n := - ELo W Ig ((D/S")- 

Subtracting 2^ from both sides and then dividing by (1 — 1/2" ^) we finally get: 



in— 1 



for n > 2, 



Ho = H, = 



(2) 



2"-i - 1 

For node through which n sequences go, the hn is the average amount of information required 
to choose how many of them make the next step left. Practical encoding of the tree using 
this approach requires to go through all internal nodes in some order (e.g. preorder) and 
encode this information for each of them (and eventually some information stored in leaves). 

In practice, such single choice of one of n + 1 possibilities can be divided into a few smaller 
choices, like if A; < r;,/2 as the first one. Using binary choices and choosing the divisions to 
make probabilities near 1/2 would allow to straightforward encode these choices as bits of 
information in such created e.g. Huffman tree. However, these probabilities are not exactly 
powers of 2, making that we would get away from the theoretical informational capacity this 
way. Using precise entropy coder instead, like Arithmetic Coding or Asymmetric Numeral 
Systems, allows to easily get as near the calculated boundary as we want. 

The recurrence ^ for Hn seems to be very difficult to solve. We will find more analytical 
formula later thanks of relating it with average depth. Let us now find some its approxima- 
tion. The probabilities in the hn formula are nearly as for the Gaussian distribution - we 
can approximate it as: 

1 , /vre 



hr 



n/2,Wn/4: 



dx 



-n 



(3) 
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Figure 3: Oscillations of Hn and Dn using different approximations. 



where p^,o-(a;) := ;^^i^e~ jg ^^j^e Gaussian distribution probability density. 

The next approximation is using that Gaussian distribution for large n is almost com- 
pletely concentrated in the center, leading to much simpler recurrence which can be easily 
solved: 

~ ~ ~ 1 /evr \ 

Hn = an-1-Ug{^)-Ug{n) (4) 

The recurrence for Hn leaves a freedom of choosing the most important parameter here: a, 
which is the average number of bits per element in this nearly linear relation. We can fit 
it to the numerical solutions of ^ for now, but later we will see suggestion to choose this 
average cost of distinguishability as a = ^ + (1 + 7) lg(e) ~ 2.77544121816583. 

Hn seems to approximate Hn well, but surprisingly there appear some small regular 
oscillations in their difference, what can be seen in Fig. |3} Looking at f{x) = 2f{x/2) 
functional equation (nearly as for Hn), we can understand the origin of this difference - 
there can appear additional growing oscillating f{x) = a;sin(27rlg(a;)) solution in positive 
half axis. Finally 

Hn := an - 1 - ^ lg(e7r/2) - ^ lg(?2) + en sin(27r lg(n) + 6) (5) 

approximates if„ for larger n with precision about 10~^ for fitted e, 6: 

a = 2.7754412 e = 1.6 • 10^^ 5 = 0.88 (6) 

The last growing oscillating term makes that the average bits of information per element is 
asymptotically in width 2e range [2.7754396,2.7754428]. 

3 Minimal depth prefix tree 

The minimal prefix tree allows to distinguish between elements, for example to attach some 
information to them. However, if we are interested in question if given object is in the 
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dictionary, such tree would often falsely give a positive answer for any random element 
(about 0.721 probability as we will see later). We can reduce this false positive probability 
by elongating leaves further as in Fig. [2]d,c - at cost of storing these suffixes up to some 
chosen depth d. We assume that these sequences are completely random, so it costs 1 bit 
per node. 

While calculating average informational content of such minimal depth d prefix tree, the 
only change from the minimal prefix tree is the cost of leaves - now it grows from to the 
remaining depth. Expressing it in the language of recurrence: 

n /n\ 

= H, Hi = d, for > 1, n > 2 : = K + ^i^t' (7) 

k=l 

for growing n, leaves will statistically get over this boundary: lim„_j.oo — Hn = 0. 

This recurrence differs from the one for Hn only by Hf = d instead of 0. We can trace 
the increase of the original Hn which came from such additional 1 from depth d to write: 

Ht = Hn + Lt' + 2Lt' + .. + {d- l)Ll = Hn + J2 («) 

i=l 

where this L fulfills recurrence: 

L? = l, V,>iL° = 0, forrf>l, n>l: Li = Y,^A'' 

k=l 

Equation ^ brings natural interpretation of L'^, which is confirmed by the recurrence: it is 
just the expected number of depth d leaves of the minimal prefix tree. It means /„ := ^ is 
the probability that a hash value will correspond to a depth d leaf, so YlT=o^n — 1- We will 
find analytic formula for later. 



4 Asymptotic behavior and average depth 

Let us denote by Dn the average depth of leaves in size n minimal prefix tree: 

oo 

Dn = Y.dli (9) 

d=0 

Situation for large d looks like in Fig. ^ - practically none of leaves get below the boundary. 
We can see encoding of such tree as storing the minimal prefix tree plus at average d—Dn bits 
per element to encode suffixes, like in this figure. So for large d, H^ goes to Hn + n{d — Dn) 
and they are equal in the limit. It can be also seen from (|8| formula. Alternative view of 
encoding this information (without leaves below d) is that it is just encoding n numbers in 
[0, 2"^ — 1] range - one of (^) combinations, which for large d is practically ^ = 2'^"~^s{n!); 



for large d > lg{n) 



H^^dn-\gin\) 



(10) 
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The amount of information does not depend on the choice of one of equivalent represen- 
tations. Finally, from that + n{d — Dn) is asymptotically equal to dn — lg(n!), we can 
remove the dependence from getting simple relation: 

Hn + lg(ri!) = nD„ (11) 

This equality can be also seen straightforward: on its both sides there is the amount of bits 
required to store the minimal prefix tree including information about the order of leaves. 
Let us now find Dn in alternative ways now, like through recurrence. Dq = Di = 0, 

_ AG) fc([fc>0]+A) + (ri-fc)(h-fc>0]+/^„,,) _^ it) k ^ + Dn 



fe=0 k=l 

T.ii\ (r;) (1 +D,)+i + Till (r;)^ 



Dn = = - for n > 2 (12) 



where [c] = 1 if condition c is true and otherwise. Using the (11) relation and Stirling 
approximation n! ~ (^)", we get 

Dn ~ Dn := \g(n) + 1.332746156 - + e sin(27r lg(n) + 6) (13) 

2n 

In the case of creating perfect binary tree: that n = 2"* prefixes would take all [0,2™ — 1] 
values, the average depth would be exactly m = lg(r;,). We see that the randomness of the 
tree increases this average depth asymptotically by 1.332746156 = a — lg(e). 

Let us now find /^: the probability that a sequence will correspond to depth d leaf. Taking 
a sequence, it corresponds to depth d leaf if among the other n — 1 sequences, there is k > 1 
having the same d — 1 length prefix, but none of them have the same d length prefix: 



n— 1 / ^ 

n — 1 



_ 2-'^)^(i _ 2-'^+^)^-^-'' = (1 - 2-'=^)"-^ - (1 - 2-'^+^)"-^ (14) 



k=l ^ 

We can for example use it to calculate moments, like the average depth we required: 

d=l d=l k=l ^ ^ 

n ^ \ oo n y \ ^j. n 

= E u) (-i)'(i - 2') E 2-"^ = E u) (-i)'(i - 2')7^ = - E 

fe=l ^ ^ d=l k=l ^ ^ ^ ' k=l 



A; / 1 - 2-^ 



The sign change makes this form still inconvenient - let us expand it further: 



= - E [D (-1)' E (2-)' = - E E [D i-^~'f 

k=l ^ ^ i=0 d=0 k=l ^ ^ 

oo 

^-+1 = E 1 - (1 - 2~'r (15) 



d=0 
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This formula can be also understood straightforward. (1 — 2^'^)" is the probability that 
no sequence will reach some fixed length d sequence - we can for example use it to find 
saturation of possible 2"^ internal nodes on depth d (reached by at least two sequences): the 
expected number of depth d internal nodes is 

2'^ (1 - (1 - - n2~'^{l - (1 - 2-^'^) 



Fixing a sequence, 1 — (1 — 2 )" term from (15) can be seen as the probability that some 



other sequence will have the same length d prefix. So to distinguish our sequence, it has to 
contain edge corresponding to the succeeding step. Finally the ( |I5| ) formula can be seen as 
the sum of probabilities of that the path to the leaf contains [d, d + 1] edge. 

The /„(x) := 1 — (1 — 2~^)" function is kind of step-like decreasing function for x G [0, oo): 
nearly 1 up to about lg(n), then quickly drops to nearly and stays there. It allows to 
understand the source of the oscillations: the exact sum depends on relative position of this 
drop to the discrete lattice of natural numbers. We could try to use Euler-Maclaurin formula 
([!]): change summation into integration. 



poo r 1 — v"^ r 1 

/ 1 _ (1 _ 2-^)"dx = lg(e) / Y^rfn = lg(e) / 1 + U + .. + u^-'du = lg(e) ^ - 

•-'0 •/ ^ ^ 

where we have used u = 1 — 2~^ substitution. The initial Euler-Maclaurin approximation is 



The Euler-Maclaurin formula requires also sum of series of derivatives in both bound- 
aries. All derivatives vanish in infinity, however in only up to {n — l)-st are zero 
{f^\o) = —{—l)'^'^^n\{\ii{2))"^S{m,n) where S{m,n) are Stirling numbers of the second 
kind). Unfortunately from Fig. |3]we see that oscillations are exploding in (and then de- 
creases to constant level), making the Euler-Maclaurin series divergent in 0. However, the 
integral suggests the choice of the basic parameter for Hn and Dn- 

a - lg(e) = lim D„ - \g{n) = J + 7lg(e) ^ 1.3327461772768672 (17) 

n— >oo 2 



5 False positive probability and comparison with 
Bloom filter 

We can now calculate probability of false positive cases - that a completely random sequence 
will get to a leaf. Let us denote this probability by F„ for the minimal prefix tree and F^ 
for the minimal depth d tree. Recurrence relations for them are 

Fo = o, Fi = i, F^ = y^^^^-^^ = y^F, + ^ 

fc=0 k=l 
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k=l \k)^^ T?d+1 _ Sr^ \k) Tj^d 



pO _ p _ Z^fc=l KkJ-^ pd+l _ \kj_ rp, 

k=l 



Having probability of leaf's depth (14), we can also find straightforward formulas: 



oo oo oo 

= n ^ 2^"^ = n ^ 2''^ ((1 - 2'^-^ - (1 - 2-'^+i)"-i) = - ^ 2-\l - 2"'^)"-^ 

d=l d=l d=l 



I oo oo 

= E + - E 2-'^ = ^ 2-'^(l - 2- 



'd\n— 1 



We can approximate this sum by integral (calculated using -u = 1 — 2 ^ substitution): 
F„ ^ - r 2-^(1 - 2-^r-^dx = t u'^-'du = ^ ^ 0.72134752 



2 Jx=o 2 ju=o ^ 

There appear sin(27r lg(?T,)) type oscillations like previously - while in a perfect tree all 
sequences would get to a leaf, for the minimal prefix trees F„ turns out to oscillate between 
0.72134 and 0.72136. As expected, for n <^ 2^^, F^ grows approximately like ^, then 
saturates near n = 2'^ and finally oscillates in the above range. 

Let us compare required amount of information with commonly used Bloom filter |1] . In 
this method we use length m bit table initialized with zeroes. For each element there are 
calculated k independent hash values from [l,m] range - positions in the table. Inserting 
the element is changing values in all these k positions to "1". The question if there is stored 
given element is asking if corresponding k positions are "1". 

As in the presented approach, false negative cases are not possible. False positive case 
appears when accidently all k positions are set to " 1" by some of n elements - probability 
of this situation is 

/ / \ kn\ ^ 

1- (l--) \ ^ 



m 

for given n, m, this probability is minimized for k = ^ lii(2)- As we should expect, this k 
corresponds to the case that bits in all positions of the table has exactly p = 1/2 probability. 
We assume that they are uncorrelated - in this case the table contains maximal amount 
of information (m (— plg(p) — {1 — p) lg(l — p)) = m bits of information) and so cannot be 
further compressed. 

This optimal choice of k leads to the false positive probability Pf = e"^^^"*^^^-* . So for 
chosen pf, the optimally chosen k means the Bloom filter memory requirements is 

n lg(pf ) 

(m =) Bn{pf) := bits of information (18) 

m(/j 

which cannot be compressed further. 

To compare with presented approach, small p/ is approximately n/2'^, so we should choose 
d = \g{n/pf). For large d, ~ dn — \g{n\), so in analogous situation we need approximately 

^h{n/Pf) ^ n\g{n/pf) — n! ^ —n\g{pf) + nlg(e) (19) 
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so asymptotically the ability to save lg(n!) bits of information about the order of hash val- 
ues itself, allows to reduce memory requirements even to ln(2) ^ 0.693 of used in Bloom filter. 

Table [T] contains comparison of memory requirements of both methods - Bloom filter 
is better only for small d: when prefix tree is not intended to provide small false positive 
probability. In this case, in opposite to Bloom filter, it additionally contains information to 
distinguish all elements, for example to attach some additional information to them. 

The large false positive probability is no longer a disadvantage for cryptographic purposes. 
Just oppositely - sometimes we would like to send/store as little as possible in case of 
obtaining it by a third party - by requiring to use shared secret information to make use 
of the message. The presented approach can be used for example when we share the same 
database and we would like to transmit some additional properties of its elements - often 
false positives would make it useless while not having the original database. The next section 
shows how to improve it further to the optimum. 



6 Reduced minimal prefix tree 

The fact that the minimal prefix tree has still some ability to exclude elements from the 
set (false positive probability is less than 1), suggests that the informational content can 
be further reduced while being ceratin of working inside the set. The idea was pointed me 
out by James Dow Allen: for internal nodes of degree 1, all sequences from the set choose 
the same direction, so we do not need to encode this direction - saving 1 bit of information 
per such node. These degree 1 nodes gave the minimal prefix tree the ability to sometimes 
(about 1 — 0.721 probability) recognize that an element is not from the the set - presented 
reduction decreases this probability to (false positive probability grows to 1). 

The average informational content of such n leaf reduced tree {H'^) can be calculated as 
for the minimal prefix tree, but using a bit smaller cost of encoding single step {h'^ < hn). In 
corresponding way this reduced tree can be encoded in practice. This time for node through 
which n sequences go, two boundary possibilities of 2~" probability: that all these sequences 
make the succeeding step left/right, are merged into one situation of 2"""'"^ probability - that 
degree of this node is 1: 



)-n+l 



In other words, we save 1 bit per each appeared degree 1 node, so at average we save 2~"+^ 
bits per each node through which n sequences go. Now H'^ can be calculated using (|2| 
recurrence for Hn, but with /i„ replaced by h'^. Knowing that due to the reduction we save 
2-m+i ]^j|-g each node through which m < n sequences go, allows to find straightforward 
formula for — H'^. The expected number of such nodes is the sum of their expected 
number on depth d: 



iV;^ = [n = m] + ^2'^(^\2-'^'^{l-2-'^) 

\Tnj 



d\n—m 
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n oo n / \ 

m=2 d=l m=2 \Tnj 



(20) 



It is also the expected number of degree 1 nodes. Using the fact that the number of degree 
2 nodes is n — 1, we can find simpler formula. So let us calculate the expected number of all 
nodes and then subtract the expected number of remaining nodes. 

The expected number of nodes on depth up to k is ^^=o '^^ ~ ~ 2^^)") . This number 
includes expansions of leaves - there are at average n{k — Dn) of them for large k. Subtracting 
them and taking — > oo limit, the expected number of degree 1 nodes is: 

oo 

nD„. - ^ (n - 2^^ (1 - (1 - 2-'^)")) - {n - 1) (21) 

d=0 

where the subtracted n — 1 term is the number of degree 2 nodes. 

Let us approximate above sum with integral, using u = l — 2~^ substitution as previously: 

/ ^ _ (1 - (1 - 2-T) dx = lg(e) / ^du = 

Jo Jo ^-u {1-uy 

= lg(e) / V— ^rfM = lg(e) / V(n - i)M*-^c/M = lg(e) V 

k=o ^ " -^0 *=1 i=l ^ 

where we have used = ^YTi=o twice, obtaining good approximation: 

oo / n— 1 _ \ 

Y.in-I" (l_(l_2-'^)")) ^^+lg(e) Lj^.-n + l -lg(e/2) 

the difference for this approximation is about 2 ■ 10~^ ■ sin(27r lg(r;,) — 0.6). 
Now substituting D„ = ^ + lg(e) ^^=1^ \ to (2l|, we get 



nD„ - ^ (n - 2^^ (1 - (1 - 2-'^)'^)) - (n - 1) ^ nlg(e/2) (22) 



d=0 



So there is approximately nlg(e/2) ^ 0.442695n degree 1 nodes in n leaf minimal prefix tree 
- the asymptotic expected number of bits per element required to encode such reduced tree 
is approximately a - lg(e/2) = 3/2 + 7lg(e) ^ 2.332746177. 



7 Conclusions 

Presented analysis shows expected values and theoretical boundaries for naturally appear- 
ing prefix trees. Practical approach can easily reach these limits for example for various 
database applications. In this moment it could be used to optimally compress these data 
for transmission or storage purposes, but knowing these boundaries alone should motivate 
to search for online processing methods with more optimal memory usage. 
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If there is required small false positive probability, this approach requires asymptotically 
about 0.693 of memory used by Bloom filter. Prom the other side, storing only the minimal 
prefix tree may have different applications, like classification (e.g. verb/noun) while being 
certain of working within some fixed dictionary. The ability of distinguishing elements 
allows to attach some information to them, paying for this ability at least additional 2.33275 
bits per element. False positive probability equal 1 of such minimal send/stored information 
can be seen as additional desired property for cryptographic applications. 

Another possible application is to optimally store unordered sequence of n numbers - 
saving lg{n\) bits of information about their order. If these numbers densely cover e.g. 
some length m range, we can create length m bit table and mark their positions - optimal 
compression of such uncorrelated numbers would require (^) fa 2~"'*("^/") bits of information, 
where h{p) — — plg(p) — (1 — p) lg(l — p). However, the problem appears when m is very 
large - in this case it might be more convenient to compress the prefix tree of these numbers 
as considered here: first encode the distribution between left and right halves of the range, 
then recursively go into these subranges. 
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Table 1: Values of considered functions for some parameters. H is informational content in 
bits, D average depth, F probability of false positives and B required bits of information 
using Bloom filter for analogous parameters. Storing H'^ cases in a standard way (n length 
d sequences) would require n*d bits, while encoding the tree allows to save about lg(n!) bits 
choosing their ordering. 



n 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


K 





1 


1.561 


1.906 


2.136 


2.302 


2.431 


2.536 


2.626 


2.704 


K 


1 


1.5 


1.811 


2.031 


2.198 


2.333 


2.447 


2.544 


2.630 


2.706 


K 


1.047 


1.547 


1.840 


2.047 


2.208 


2.340 


2.451 


2.547 


2.632 


2.708 


K 





2 


4.082 


6.224 


8.407 


10.62 


12.84 


15.08 


17.34 


19.60 


Hn 





3 


5.415 


7.986 


10.62 


13.27 


15.94 


18.63 


21.32 


24.02 


Hn 


0.728 


3.004 


5.487 


8.055 


10.67 


13.31 


15.98 


18.66 


21.35 


24.05 


n 


5 


9.125 


12.79 


16.15 


19.30 


22.31 


25.19 


27.99 


30.72 


33.39 


n 


9 


17.00 


24.44 


31.46 


38.17 


44.62 


50.86 


56.91 


62.81 


68.56 


^10 

n 


10 


19.00 


27.43 


35.44 


43.13 


50.57 


57.78 


64.81 


71.67 


78.38 


^15 

n 


15 


29.00 


42.42 


55.42 


68.09 


80.51 


92.70 


104.7 


116.5 


128.2 


Hi' 


20 


39.00 


57.46 


75.42 


93.09 


110.5 


127.7 


144.7 


161.5 


178.2 


lg(n!) 





1 


2.585 


4.585 


6.907 


9.492 


12.30 


15.30 


18.47 


21.79 


Dn 





2 


2.667 


3.143 


3.505 


3.794 


4.035 


4.241 


4.421 


4.581 


Fn 


1 


0.667 


0.714 


0.724 


0.724 


0.722 


0.721 


0.721 


0.721 


0.721 


n 


20 


50 


100 


200 


500 


1000 


2000 


5000 


10000 


100000 


hn 


3.207 


3.869 


4.369 


4.869 


5.530 


6.030 


6.530 


7.191 


7.691 


9.352 


H'n 

n 


42.45 


111.8 


227.9 


460.7 


1160 


2327 


4658 


11656 


23319 


233264 


Hn 


51.29 


133.9 


272.2 


549.2 


1381 


2768 


5543 


13869 


27746 


277534 


Hi 


58.82 


136.1 


272.3 


549.2 


1381 


2768 


5543 


13869 


27746 


277534 


n 


120.4 


245.1 


411.6 


692.1 


1462 


2789 


5544 


13869 


27746 


277534 


n 


139.7 


290.5 


494.0 


827.6 


1651 


2931 


5584 


13869 


27746 


277534 




238.9 


535.9 


975.8 


1757 


3748 


6531 


11186 


22219 


37075 


277758 




308.1 


675.0 


1206 


2124 


4363 


7305 


11809 


20608 


28813 


69971 




338.9 


785.8 


1475 


2755 


6233 


11473 


20956 


45815 


81732 


501779 




452.4 


1036 


1927 


3565 


7960 


14478 


26073 


55666 


96970 


502238 


HI' 


538.9 


1286 


2475 


4755 


11233 


21471 


40947 


95768 


181542 


1483314 




740.9 


1757 


3370 


6451 


15173 


28903 


54921 


127767 


241107 


1931833 


lg(n!) 


61.08 


214.2 


524.8 


1245 


3767 


8529 


19053 


54233 


118458 


1516704 


Dn 


5.62 


6.962 


7.969 


8.973 


10.30 


11.30 


12.30 


13.62 


14.62 


17.94 


F9 

n 


0.038 


0.092 


0.172 


0.304 


0.542 


0.681 


0.720 


0.721 


0.721 


0.721 


n 


0.019 


0.047 


0.092 


0.172 


0.358 


0.542 


0.680 


0.721 


0.721 


0.721 


lO^F^l^ 


0.061 


0.152 


0.305 


0.608 


1.510 


2.991 


5.862 


13.80 


25.05 


71.45 




1.907 


4.768 


9.536 


19.07 


47.67 


95.31 


190.5 


475.3 


947.6 


8954 


10«Ff 


1.863 


4.657 


9.313 


18.63 


46.57 


93.13 


186.3 


465.7 


931.3 


9313 



